Add MPI to driver code #34

tomdeakin · 2017-06-20T15:45:48Z

Add the ability to run BabelStream across multiple nodes with MPI. The input size is assumed to be per MPI rank, and therefore is similar to weak scaling. The reported bandwidth is of the entire parallel run.

All changes were made in the driver code, and no changes should be required in the different implementations. Changes involved: 1. Initilising MPI and getting rank and size 2. Guarding std::cout so only rank 0 prints 3. Adding Barriers between kernels 4. Dot produce performs Reduction.

jrprice

Looks fine overall. Successfully ran across 256 nodes with decent efficiency.

I've made some minor suggestions w.r.t. output, but nothing particularly important.

What are we going to do about the Makefiles? Do we want some standard way of enabling MPI? For OpenMP you can do it without touching the Makefile, but other models like CUDA need some work.

I can imagine a couple of possible ways of doing this. There could be some variable the user can set:

make -f OpenMP.make MPI=1

Or we could have a special target in each Makefile:

make -f OpenMP.make mpi

Or something else. Not really sure what's best, just brainstorming here.

jrprice · 2017-06-27T10:44:19Z

main.cpp

  std::streamsize ss = std::cout.precision();
  std::cout << std::setprecision(1) << std::fixed
    << "Array size: " << ARRAY_SIZE*sizeof(T)*1.0E-6 << " MB"
    << " (=" << ARRAY_SIZE*sizeof(T)*1.0E-9 << " GB)" << std::endl;
  std::cout << "Total size: " << 3.0*ARRAY_SIZE*sizeof(T)*1.0E-6 << " MB"
    << " (=" << 3.0*ARRAY_SIZE*sizeof(T)*1.0E-9 << " GB)" << std::endl;
  std::cout.precision(ss);
+  }


Maybe worth making these include the phrase "per rank" for these printouts when in MPI mode to make things crystal clear?

Could also show the actual total size, but not sure how interesting that is.

Is there a nice way to alter the string depending on the USE_MPI preprocessor define?

Not really, I think this is about as good at can get:

std::cout << std::setprecision(1) << std::fixed #ifdef USE_MPI << "Array size (per MPI rank): " #else << "Array size: " #endif << "...";

Fixed in cb92f94

jrprice · 2017-06-27T10:47:01Z

main.cpp

@@ -179,6 +237,10 @@ void run()
  check_solution<T>(num_times, a, b, c, sum);


I think this needs to be in an #ifdef USE_MPI \n if(rank == 0) section as well - otherwise you get massively spammed if verification fails.

On a related note, we really need to sort out verification for the reduction as per #20, since this almost always fails now when using a large number of MPI ranks.

Verification printing fixed in 5a64ce1
#20 still needs to be addressed, although all this MPI code assumes T is double as I've used MPI_Double for messages. We should probably support MPI_Float too.

jrprice · 2017-06-27T10:48:20Z

main.cpp

@@ -187,6 +249,7 @@ void run()
    << std::left << std::setw(12) << "Average" << std::endl;

  std::cout << std::fixed;
+  }


This is fine, but might be interesting to see these results per rank as well as the aggregate bandwidth. This would make it easier to see how well things scale when running on lots of nodes.

One could always run the benchmark with -np 1 (or run the non-MPI benchmark). We could also run the benchmark on a singe node first, and then print out the efficiency. This would probably need a bit of an overhaul in the way the driver works though. Not sure if it's worth the pain.

I just meant that it might be nice to print out the average per-rank bandwidths, I'm not talking about actually re-running on a single node.

e.g. If I run on a single P100 I get 550 GB/s. When I run on a cluster of 2000 P100s, I get some very large number, but do I still get 550 GB/s per rank?

To be honest I'm just being lazy - it's trivial for the user to open a calculator and do that division themselves. Feel free to ignore this comment :-)

jrprice · 2017-06-27T10:49:19Z

main.cpp

-    double average = std::accumulate(timings[i].begin()+1, timings[i].end(), 0.0) / (double)(num_times - 1);
+    double average = std::accumulate(timings[i].begin()+1, timings[i].end(), 0.0);
+
+#ifdef USE_MPI


Indentation inside this block looks a little off.

Fixed in a57fdec

tomdeakin · 2017-06-27T15:31:04Z

Re the Makefile changes. I guess the ideal solution will depend on how the changes are required. The MPI=yes option looks nice rather than having a separate build option, however the latter will be better if substantial changes are required for some models. I guess I'll just try to fold in the MPI=yes option and see how far I get. We can always change later...

Prevents lots of output in case of failure

tomdeakin · 2017-06-27T16:01:16Z

ef22cbc implements MPI=yes for the CUDA implementation. It builds the implementation to an object file, and then compiles this in with the main driver. If MPI=yes is set on the make invocation then the compiler switches out to mpicxx and adds the USE_MPI define to EXTRA_FLAGS.

Does this approach look OK? If so, I'll start to roll out similar changes to the other models.

tomdeakin · 2017-06-27T16:02:44Z

Actually, piggybacking on EXTRA_FLAGS doesn't work if it's also set on the CLI... Need to rethink this!

jrprice · 2017-06-29T16:10:10Z

Does this approach look OK? If so, I'll start to roll out similar changes to the other models.

Yeah looks fine to me, bar the EXTRA_FLAGS issue you raise. I guess it'll just need an extra internal flags variable, which is annoying.

tomdeakin · 2021-11-25T13:40:41Z

Closing as we'd prefer standalone MPI implementation. Opening new issue to this effect.

tomdeakin added 3 commits June 20, 2017 15:47

Pull out min/max values into double

ee89a6a

Dot kernel should do Allreduce so value is the same everywhere

afa1e23

tomdeakin requested a review from jrprice June 20, 2017 15:45

jrprice approved these changes Jun 27, 2017

View reviewed changes

tomdeakin added 4 commits June 27, 2017 16:36

Only check verificaiton on master MPI rank

5a64ce1

Prevents lots of output in case of failure

Fix indenting

a57fdec

[CUDA] Seperate build of driver

fc71f6d

[CUDA] Add MPI=yes option

ef22cbc

Add 'per rank' to size print outs

cb92f94

tomdeakin closed this Nov 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MPI to driver code #34

Add MPI to driver code #34

tomdeakin commented Jun 20, 2017

jrprice left a comment

jrprice Jun 27, 2017

tomdeakin Jun 27, 2017

jrprice Jun 29, 2017

tomdeakin Jun 30, 2017

jrprice Jun 27, 2017 •

edited

Loading

tomdeakin Jun 27, 2017 •

edited

Loading

jrprice Jun 27, 2017

tomdeakin Jun 27, 2017

jrprice Jun 29, 2017

jrprice Jun 27, 2017

tomdeakin Jun 27, 2017

tomdeakin commented Jun 27, 2017

tomdeakin commented Jun 27, 2017

tomdeakin commented Jun 27, 2017

jrprice commented Jun 29, 2017

tomdeakin commented Nov 25, 2021

		@@ -179,6 +237,10 @@ void run()
		check_solution<T>(num_times, a, b, c, sum);

Add MPI to driver code #34

Add MPI to driver code #34

Conversation

tomdeakin commented Jun 20, 2017

jrprice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrprice Jun 27, 2017 • edited Loading

Choose a reason for hiding this comment

tomdeakin Jun 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomdeakin commented Jun 27, 2017

tomdeakin commented Jun 27, 2017

tomdeakin commented Jun 27, 2017

jrprice commented Jun 29, 2017

tomdeakin commented Nov 25, 2021

jrprice Jun 27, 2017 •

edited

Loading

tomdeakin Jun 27, 2017 •

edited

Loading